Dr James Bartlett and Dr Sarah Charles
Crash course in frequentist statistical inference
How do researchers often try and test the null?
Equivalence testing tutorial
Target data sets:
Feeling the Future (Bem, 2011)
Statistical Reasoning After Being Taught With R Programming Versus Hand Calculations (Ditta & Woodward, 2022)
Approach to statistical inference behind commonly used p-values
“Objective” theory where probabilities exist in the world, they are there to be discovered independent from the observer
Probability cannot be assigned to individual events, only a collective
We can calculate the probability of data given a hypothesis
This is where p-values come in: We can calculate the probability of observing data (or more extreme), assuming the null hypothesis is true
See it as a measure of surprise:
Low probability (small p-value) = data would be surprising under the null
High probability (large p-value) = data would not be surprising under the null
We can either reject or retain the null hypothesis; we cannot accept the null
The dominant - but often unnamed - approach to hypothesis testing (Lakens, 2021)
Suitable when the null hypothesis is plausible / meaningful
Creates a decision procedure on how to act while controlling error rates: reject the null hypothesis or not?
Type I errors / false positives controlled through alpha (e.g., \(\alpha\) = .05)
Type II errors / false negatives controlled through beta (e.g., \(\beta\) = .20)
Figure from Lakens et al. (2018)
Important to keep in mind what p-values can and cannot do (Wasserstein & Lazar, 2016)
p-values can indicate how incompatible the data are with a specified statistical model
p-values do not tell you the probability your alternative hypothesis is true
p-values do not measure the size of an effect or the importance of a result
Scientific conclusions should not solely be based on whether a p-value passes a given alpha threshold or not
Bem (2011) published an infamous series of studies purporting to show precognition (psychic abilities)
100 participants (study 1) saw two hidden windows: one empty and one containing an erotic/non-erotic figure
Participants had to guess which window contained the figure, where 0% would be never correct, 50% a coin flip, and 100% correct every guess
What success rate (%) would convince you someone had psychic abilities?
For non-erotic images, participant’s hit rate was not significantly higher than chance, t (99) = -0.15, p = 0.884
However, for erotic images, participant’s hit rate was significantly higher than chance, t (99) = 2.51, p = 0.014
So, maybe we do have evidence for precognition (at least for predicting the future position of erotic images…), but what about the effect size?
The difference between a significant and non-significant result may not represent a meaningful shift (Interaction fallacy; Gelman & Stern, 2006)
Even when a result is statistically significant, the effect size might be entirely meaningless (Meehl’s paradox; Kruschke & Liddell, 2018)
It is important to keep in mind whether the null hypothesis is plausible / meaningful for your study (Crud factor; Orben & Lakens, 2020)
Is there no meaningful difference between two competing interventions?
Does your theory rule out specific effects?
Is your correlation too small to be meaningful?
Inferences in Psychology Teaching and Learning: A Review of Statistics Misconceptions
Unfortunately little progress in reviewing 76 articles…
Our RQ: Can studies in psychology teaching and learning meet their inferential goals?
What is the prevalence of misconceptions in interpreting non-significant results in psychology teaching and learning?
How does research in psychology teaching and learning justify their sample sizes?
No statistical approach can directly support the null hypothesis of exactly 0
Equivalence testing is one approach and originates from drug development research
Equivalence testing flips NHST logic and uses two one-sided t-tests to test your effect against two boundaries:
Is your effect significantly larger than a lower bound?
Is your effect significantly smaller than an upper bound?
Figure from Lakens et al. (2018)
Figure from Lakens (2017)
Alpha
Equivalence bounds
Sample size
Flexible R package (Lakens & Caldwell) that can apply equivalence or interval testing to focal tests:
T-tests
Correlations
Meta-analysis
Non-parametric tests
Technology or Tradition? A Comparison of Students’ Statistical Reasoning After Being Taught With R Programming Versus Hand Calculations (Ditta & Woodward, 2022)
Compared conceptual understanding of statistics at the end of a 10-week intro course
Students completed one of two versions:
Formula-based approach to statistical tests (n = 57)
R code approach to statistical tests (n = 60)
Research question (RQ): Does learning through hand calculations or R code lead to greater conceptual understanding of statistics?
Between-subjects IV: Formula-based or R code approach course
DV: Final exam (conceptual understanding questions) score as proportion correct (%)
Welch Two Sample t-test
data: e3total by condition
t = -1.117, df = 110.97, p-value = 0.2664
alternative hypothesis: true difference in means between group HC and group R is not equal to 0
95 percent confidence interval:
-7.584355 2.116173
sample estimates:
mean in group HC mean in group R
69.29091 72.02500
The traditional t-test was non-significant, but was there no meaningful difference?
We can apply an equivalence test using bounds of ±10% for our smallest effect size of interest
Welch Modified Two-Sample t-Test
Hypothesis Tested: Equivalence
Equivalence Bounds (raw):-10.000 & 10.000
Alpha Level:0.05
The equivalence test was significant, t(110.97) = 2.968, p = 1.83e-03
The null hypothesis test was non-significant, t(110.97) = -1.117, p = 2.66e-01
NHST: don't reject null significance hypothesis that the effect is equal to zero
TOST: reject null equivalence hypothesis
TOST Results
t SE df p.value
t-test -1.117011 2.447684 110.97 2.664022e-01
TOST Lower 2.968483 2.447684 110.97 1.833635e-03
TOST Upper -5.202506 2.447684 110.97 4.542552e-07
Effect Sizes
estimate SE lower.ci upper.ci conf.level
Raw -2.7340909 2.447684 -6.7940673 1.3258855 0.9
Hedges' g(av) -0.2073357 0.188892 -0.5208061 0.1001411 0.9
Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").
Theory / subject knowledge
Small telescopes approach
Effect size benchmarks
Null hypothesis significance testing and p-values are suited to specific roles
If supporting the null is a desirable inference, you need techniques like equivalence testing
This allows you to conclude whether effects are statistically equivalent or not
Setting equivalence bounds is the hardest decision which you must transparently justify
For practical tutorials, our new PsyTeachR book includes an appendix walking through equivalence testing in R
Lakens (2023) online chapter on equivalence testing and interval hypotheses
Lakens (2017) and Lakens et al. (2018) tutorial articles on equivalence testing
Bartlett et al. (2022) example of equivalence testing in the wild
Charles et al. (2022) slightly more advanced equivalence testing in the wild
Any questions?
Dr James Bartlett
@JamesEBartlett
james.bartlett@glasgow.ac.uk
Dr Sarah Charles
@SarahCharlesNC
sarah.charles@ntu.ac.uk